[SPARK-12343][YARN] Simplify Yarn client and client argument#11603
[SPARK-12343][YARN] Simplify Yarn client and client argument#11603jerryshao wants to merge 10 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Do we still need to support environment variables?
There was a problem hiding this comment.
No as I pointed out the other jira SPARK-3374 was proposing to remove. There might be some internal env variables used just to communicate between Client and AM but anything external can be removed for 2.x
There was a problem hiding this comment.
I think this could be done in SPARK-3374 as a separate patch to clean up all the internal env variables.
There was a problem hiding this comment.
that would be ok, but I think you have already removed most of them, it looks like there are only a couple left in this file so it might be just as easy to do here.
|
Test build #52743 has finished for PR 11603 at commit
|
|
Hmm, it looks like this overlaps a lot with https://issues.apache.org/jira/browse/SPARK-3374 which is to deprecate the env variables, remove deprecated configs, and remove the client args. Which someone is actively working on too, so I guess that kind of stinks to duplicate the work. Taking a quick skim this seems like you did everything proposed in that so we might as well dup these. Or actually why don't you just update this PR to have both jira in description. |
There was a problem hiding this comment.
Removing this and letting it use spark.jars uses the spark addJars method of distributing jars instead of using the yarn distributed cache like it did before. I'm not sure I want to change that, I would rather keep as much as possible consistent using the YARN distributed cache unless we have a reason to change.
If this means we need to add another config like spark.yarn.dist.jars I'm fine with that. We can leave it as undocumented for now if we want since spark.jars/spark.files isn't documented either, thought I'm not quite sure why.
There was a problem hiding this comment.
Yeah, we shouldn't lose the distributed cache functionality.
On that topic, but probably orthogonal to this change, at some point I'd like to see yarn-client somehow also use the distributed cache for these jars. IIRC that doesn't happen at the moment.
There was a problem hiding this comment.
(Reading the rest of the change, if you figure out an easy way to have YARN's Client class handle spark.jars, you could solve both problems here... maybe some check in SparkContext to not distribute them in YARN mode?)
There was a problem hiding this comment.
I see, thanks a lot for your suggestions.
|
Personally I'm fine with leaving some parameters in ClientArguments as long as they don't overlap with configs. The main thing is to not have to process the same config/setting multiple times. Here we either need the parameters or add in new configs. Since these are all private we can change it later if needed. |
|
Note I think the biggest thing here is to thoroughly test this to make sure all the various options work in both modes (cluster/client) with all spark-submit, pyspark, spark-shell, sparkr. |
There was a problem hiding this comment.
Do we need to pass these as explicit command line arguments to the AM? Since we're simplifying code, might just take the chance and make the AM read these from the configuration too.
There was a problem hiding this comment.
@vanzin , so you mean we also need to simplify the command line argument for AM?
|
Part of SPARK-12343 was to make |
There was a problem hiding this comment.
This is a core config, so should be declared in core/src/main/scala/org/apache/spark/internal/config/package.scala instead. Same for EXECUTOR_MEMORY, PY_FILES, and probably also EXECUTOR_CORES, although not completely sure about that last one.
There was a problem hiding this comment.
I see, will change it.
There was a problem hiding this comment.
Here I put all the additional jars into a configuration spark.yarn.dist.jars, this will be picked by yarn/client and put into distributed cache. So now both in yarn client and cluster mode, additional jars will be put into distributed cache.
Another thing is that do we need to put user jar into distributed cache for yarn client mode, I think it is doable, not sure is there any special concern?
There was a problem hiding this comment.
I think we should just leave that as is for now. We can file separate jira if we want to change.
There was a problem hiding this comment.
So dist.files and dist.archives are public and documented, seems like we should make dist.jars public and document it also in the yarn docs unless someone has reason not to.
There was a problem hiding this comment.
Sure, I will add it to the yarn doc.
There was a problem hiding this comment.
Looks like there's another config "spark.jars" to handle this property, maybe we don't need to add another, and for dist.jars we could make it as internal use for yarn only.
There was a problem hiding this comment.
spark.jars is for distributing via spark internal mechanisms, this is done via the distributed cache, we should add it to the yarn only section of docs similar to the dist.files and dist.archives.
|
Test build #52825 has finished for PR 11603 at commit
|
|
Test build #53412 has finished for PR 11603 at commit
|
|
Test build #53664 has finished for PR 11603 at commit
|
|
Test build #53751 has finished for PR 11603 at commit
|
|
we should remove SPARK_LOG4J_CONF functionality in Client.scala have you looked at test failures? |
|
OK, I will remove this old one. Looks like this unit test is not related. |
|
Another environment is |
|
Test build #54025 has finished for PR 11603 at commit
|
|
We should file a separate jira for the SPARK_JAVA_OPTS since its used in core and other places. |
| private[spark] val DRIVER_USER_CLASS_PATH_FIRST = | ||
| ConfigBuilder("spark.driver.userClassPathFirst").booleanConf.withDefault(false) | ||
|
|
||
| private[spark] val DRIVER_MEMORY = ConfigBuilder("spark.driver.memory") |
|
Mostly good, just a few comments. |
|
Test build #54420 has finished for PR 11603 at commit
|
|
Test build #54499 has finished for PR 11603 at commit
|
| (args.addJars, LocalResourceType.FILE, true), | ||
| (args.files, LocalResourceType.FILE, false), | ||
| (args.archives, LocalResourceType.ARCHIVE, false) | ||
| (sparkConf.get(JARS_TO_DISTRIBUTE).orNull, LocalResourceType.FILE, true), |
There was a problem hiding this comment.
minor: I think you could simplify this a little bit by listing just the options (instead of sparkConf.get(option).orNull); if you made the config entries lists defaulting to Nil (adding toSequence to their builders, and using withDefault instead of optional) you could save even more code below.
|
LGTM, I just left a really minor suggestion to save a few lines of code but no need to address that right now. |
|
Test build #54596 has finished for PR 11603 at commit
|
|
Merging to master, thanks! |
What changes were proposed in this pull request?
Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.
How was this patch tested?
This patch is tested manually with unit test.
CC @vanzin @tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove
ClientArguments, through refactoring some arguments like--class,--argare not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.